1 About

1.1 Contributions

Please note that authorship is alphabetical. Contributions are listed below - see github for details and who to blame for what :-).

1.3 Making comments

Free and easy to do. Raise an issue at:

2 Intro

Exploring covid 19 and related data.

This tweet:

Suggested something strange was happening with overall USA deaths in early 2020.:

USA deaths plot

USA deaths plot

This very brief analysis attempts to re-create some of their results from the same data.

3 Data

Sourced from the downloads button of https://gis.cdc.gov/grasp/fluview/mortality.html

“The National Center for Health Statistics (NCHS) collects and disseminates the Nation’s official vital statistics. NCHS collects death certificate data from state vital statistics offices for virtually all deaths occurring in the United States. Pneumonia and influenza (P&I) deaths are identified based on ICD-10 multiple cause of death codes.”

dt <- data.table::fread(paste0(myParams$dPath, "/USA-all-data.csv"))

t <- table(dt$SEASON)
kableExtra::kable(t, caption = "Seasons used") %>% kable_styling()
Table 3.1: Seasons used
Var1 Freq
2015-16 52
2016-17 52
2017-18 52
2018-19 52
2019-20 24

Note that the 2020 data is incomplete.

Before we go further we’ll convert week numbers (in the data) to standardised dates. This will mean some of the dates for the weeks are not exactly right but it might make for easier plotting.

# why can't they just be published with dates??

dt[, `:=`(date, ifelse(SEASON == "2015-16" & WEEK > 39, as.Date("2015-01-01") + lubridate::weeks(WEEK), 
    NA))]
dt[, `:=`(date, ifelse(SEASON == "2015-16" & WEEK < 40, as.Date("2016-01-01") + lubridate::weeks(WEEK), 
    date))]
dt[, `:=`(date, ifelse(SEASON == "2016-17" & WEEK > 39, as.Date("2016-01-01") + lubridate::weeks(WEEK), 
    date))]
dt[, `:=`(date, ifelse(SEASON == "2016-17" & WEEK < 40, as.Date("2017-01-01") + lubridate::weeks(WEEK), 
    date))]
dt[, `:=`(date, ifelse(SEASON == "2017-18" & WEEK > 39, as.Date("2017-01-01") + lubridate::weeks(WEEK), 
    date))]
dt[, `:=`(date, ifelse(SEASON == "2017-18" & WEEK < 40, as.Date("2018-01-01") + lubridate::weeks(WEEK), 
    date))]
dt[, `:=`(date, ifelse(SEASON == "2018-19" & WEEK > 39, as.Date("2018-01-01") + lubridate::weeks(WEEK), 
    date))]
dt[, `:=`(date, ifelse(SEASON == "2018-19" & WEEK < 40, as.Date("2019-01-01") + lubridate::weeks(WEEK), 
    date))]
dt[, `:=`(date, ifelse(SEASON == "2019-20" & WEEK > 39, as.Date("2019-01-01") + lubridate::weeks(WEEK), 
    date))]
dt[, `:=`(date, ifelse(SEASON == "2019-20" & WEEK < 40, as.Date("2020-01-01") + lubridate::weeks(WEEK), 
    date))]
dt[, `:=`(date, lubridate::as_date(date))]

# check coding head(dt[WEEK == 1])
t <- head(dt[WEEK == 1 | WEEK == 52, .(SEASON, WEEK, date)])

kableExtra::kable(t, captopn = "Check we coded the weeks correctly") %>% kable_styling()
SEASON WEEK date
2019-20 52 2019-12-31
2019-20 1 2020-01-08
2018-19 52 2018-12-31
2018-19 1 2019-01-08
2017-18 52 2017-12-31
2017-18 1 2018-01-08
# also clean up data where needed
dt[, `:=`(totDeaths, as.numeric(gsub(",", "", `TOTAL DEATHS`)))]
dt[, `:=`(pneuDeaths, as.numeric(gsub(",", "", `NUM PNEUMONIA DEATHS`)))]
dt[, `:=`(fluDeaths, as.numeric(gsub(",", "", `NUM INFLUENZA DEATHS`)))]

dt$V13 <- NULL
dt$V14 <- NULL
dt$V15 <- NULL
dt$V16 <- NULL
dt$V17 <- NULL
dt$V18 <- NULL

skimr::skim(dt)
Table 3.2: Data summary
Name dt
Number of rows 232
Number of columns 16
_______________________
Column type frequency:
character 7
Date 1
logical 1
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
AREA 0 1 8 8 0 1 0
AGE GROUP 0 1 3 3 0 1 0
SEASON 0 1 7 7 0 5 0
NUM INFLUENZA DEATHS 0 1 1 5 0 135 0
NUM PNEUMONIA DEATHS 0 1 5 5 0 220 0
TOTAL DEATHS 0 1 6 6 0 228 0
PERCENT COMPLETE 0 1 6 6 0 2 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2015-10-08 2020-03-18 2017-12-27 232

Variable type: logical

skim_variable n_missing complete_rate mean count
SUB AREA 232 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
WEEK 0 1 26.62 15.67 1.0 12.00 27.0 41.00 52.0 ▇▆▆▆▇
THRESHOLD 0 1 6.78 0.78 5.4 6.07 6.8 7.40 8.2 ▆▆▅▇▅
BASELINE 0 1 6.41 0.77 5.1 5.70 6.4 7.00 7.8 ▇▆▇▇▆
PERCENT P&I 0 1 6.56 1.12 4.9 5.70 6.2 7.32 10.9 ▇▅▃▁▁
totDeaths 0 1 53688.44 3467.98 37641.0 51073.50 53119.5 55809.75 67495.0 ▁▁▇▃▁
pneuDeaths 0 1 3380.11 591.94 2431.0 2924.50 3214.5 3782.50 5583.0 ▇▇▅▁▁
fluDeaths 0 1 167.95 275.90 2.0 13.75 31.5 257.50 1626.0 ▇▂▁▁▁
tdt <- dt[`PERCENT COMPLETE` != "> 100%"]
kableExtra::kable(table(tdt$date, tdt$`PERCENT COMPLETE`), caption = "Incomplete data...") %>% 
    kable_styling()
Table 3.2: Incomplete data…
79.10%
2020-03-18 1
dt <- dt[`PERCENT COMPLETE` == "> 100%"]

We remove the incomplete data from further analysis.

Check on total number of deaths per season.

t <- dt[, .(Total = dkUtils::tidyNum(sum(totDeaths)), Pneumonia = dkUtils::tidyNum(sum(pneuDeaths)), 
    Influenza = dkUtils::tidyNum(sum(fluDeaths))), keyby = .(SEASON)]
kableExtra::kable(t, caption = "Sum of deaths by season - do these look sensible?", 
    ) %>% kable_styling()
Table 3.3: Sum of deaths by season - do these look sensible?
SEASON Total Pneumonia Influenza
2015-16 2,697,072 178,002 3,448
2016-17 2,790,317 179,621 6,954
2017-18 2,835,739 180,132 15,620
2018-19 2,829,403 168,422 7,171
2019-20 1,265,545 75,578 5,406

4 All deaths (all ages)

Figure 4.1 is the one causing the interest. Note that we have set the y axis to start at zero to avoid over-emphasising the slope. Week 10 is roughly:

dt[WEEK == 10, .(SEASON, WEEK, date)]
##     SEASON WEEK       date
## 1: 2019-20   10 2020-03-11
## 2: 2018-19   10 2019-03-12
## 3: 2017-18   10 2018-03-12
## 4: 2016-17   10 2017-03-12
## 5: 2015-16   10 2016-03-11

Note that there are potential issues with the data. See Section 7.

week10Text <- paste0("Week 10: ", dt[WEEK == 10 & SEASON == "2019-20", (date)])
addWeek10Label <- function(p) {
    p <- p + annotate("text", x = 10.5, y = yMax * 0.4, angle = 10, size = 3, label = week10Text, 
        hjust = 0)
    p <- p + geom_vline(xintercept = 10, colour = "#CC79A7", alpha = myParams$vLineAlpha)
    return(p)
}

p <- ggplot2::ggplot(dt, aes(y = totDeaths, x = WEEK, colour = SEASON, group = SEASON)) + 
    geom_point() + ylim(0, NA) + labs(x = "Week", y = "Total deaths", caption = "Plot by @dataknut using data from https://gis.cdc.gov/grasp/fluview/mortality.html\nIncomplete data excluded")
yMax <- max(dt$totDeaths)
p <- addWeek10Label(p)
p
USA recorded deaths by flu season (all deaths, all ages)

Figure 4.1: USA recorded deaths by flu season (all deaths, all ages)

Figure 4.2: interactive version (hover the mouse over a dot).

plotly::ggplotly(p)

Figure 4.2: USA recorded deaths by flu season (all deaths)

5 Deaths from pneumonia (all ages)

Figure 5.1 is pneumonia deaths. They are trending downwards too.

p <- ggplot2::ggplot(dt, aes(y = pneuDeaths, x = WEEK, colour = SEASON, group = SEASON)) + 
    geom_point() + ylim(0, NA) + labs(x = "Week", y = "Pneumonia deaths", caption = "Plot by @dataknut using data from https://gis.cdc.gov/grasp/fluview/mortality.html")
yMax <- max(dt$pneuDeaths)
p <- addWeek10Label(p)
p
USA recorded Pneumonia deaths by flu season, all ages

Figure 5.1: USA recorded Pneumonia deaths by flu season, all ages

The ‘missing’ deaths here don’t seem to account for the overall total deaths reduction. It seems that pneumonia deaths are not being re-classed as covid since USA covid deaths are not recorded as starting until after March 10th. Although it is possible that earlier covid deaths were not recorded as covid or pneuomina (or influenza).

6 Deaths from influenza (all ages)

Figure 6.1 is influenza deaths.

p <- ggplot2::ggplot(dt, aes(y = fluDeaths, x = WEEK, colour = SEASON, group = SEASON)) + 
    geom_point() + ylim(0, NA) + labs(x = "Week", y = "Influenza deaths", caption = "Plot by @dataknut using data from https://gis.cdc.gov/grasp/fluview/mortality.html")
yMax <- max(dt$fluDeaths)
p <- addWeek10Label(p)
p
USA recorded Influenza deaths by flu season, all ages

Figure 6.1: USA recorded Influenza deaths by flu season, all ages

Yep, 2017-2018 was a bad flu year in the USA

7 Conclusions

The decline in 2020 ‘all deaths’ appears real provided the 2020 data is complete. For more on this question, see:

We can speculate about why and there are many suggestions in the thread sparked by the tweet:

8 Runtime

Report generated using knitr in RStudio with R version 3.6.3 (2020-02-29) running on x86_64-apple-darwin15.6.0 (Darwin Kernel Version 19.4.0: Wed Mar 4 22:28:40 PST 2020; root:xnu-6153.101.6~15/RELEASE_X86_64).

t <- proc.time() - myParams$startTime

elapsed <- t[[3]]

Analysis completed in 4.358 seconds ( 0.07 minutes).

R packages used:

  • data.table - (Dowle et al. 2015)
  • ggplot2 - (Wickham 2009)
  • here - (Müller 2017)
  • kableExtra - (Zhu 2018)
  • lubridate - (Grolemund and Wickham 2011)
  • plotly - (Sievert et al. 2016)
  • skimr - (Arino de la Rubia et al. 2017)

References

Arino de la Rubia, Eduardo, Hao Zhu, Shannon Ellis, Elin Waring, and Michael Quinn. 2017. Skimr: Skimr. https://github.com/ropenscilabs/skimr.

Dowle, M, A Srinivasan, T Short, S Lianoglou with contributions from R Saporta, and E Antonyan. 2015. Data.table: Extension of Data.frame. https://CRAN.R-project.org/package=data.table.

Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. http://www.jstatsoft.org/v40/i03/.

Müller, Kirill. 2017. Here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.

Sievert, Carson, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, and Pedro Despouy. 2016. Plotly: Create Interactive Web Graphics via ’Plotly.js’. https://CRAN.R-project.org/package=plotly.

Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

Zhu, Hao. 2018. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.